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TECHNICAL FIELD 

[0001] The invention relates to computer architectures that support SMD (Single 
Instruction, Multiple Data) instructions. 

BACKGROUND 

[0002] Some computer architectures support execution of SMD (Single Instruction, 
Multiple Data) instructions. Each such instruction performs one operation on multiple 
sets of data. During execution, multiple simultaneous instances of one operation are used 
to process different data. As an example, one instruction might perform several 
independent subtractions or other mathematical operations (but the same operation 
everywhere). 

[0003] The Pentium® 4 microprocessor from Intel Corporation is one example of a 
computer architecture that is capable of executing SIMD instructions. The 
microprocessor utilizes 144 instructions, known as SSE2 (Streaming SIMD Extensions 2) 
instructions, to operate on data stored in registers. The microprocessor has eight 128-bit 
integer registers, each of which can hold multiple sets of data, such as two 64-bit integers, 
four 32-bit integers, eight 16-bit integers, or sixteen 8-bit integers. Each individual SSE2 
instruction performs one operation on the multiple data sets contained in a register, 
possibly with inputs from another register. The SSE2 instructions reduce the overall 
number of instructions used to execute a particular program task, thereby increasing 
overall performance. 

[0004] Today, microprocessors are called upon to perform many multiplication and 
division operations almost instantaneously. Security is one example where computers are 
called upon to perform many rigorous operations. Security functions employ complex 
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cryptographic algorithms that often use exponentiation operations (e.g., the RSA 
algorithm) involving many modular multiplication and operations. It is a continuing goal 
to find ways to reduce computations in complex algorithms and thereby improve 
performance. 

[0005] Montgomery multiplication is a well-known algorithm for modular multiplication 
which avoids division. It achieves this by reducing double-length products from the right 
rather than from the left. 

SUMMARY 

[0006] An architecture and methodology for implementing Montgomery multiplication 
on a computer system that supports SIMD instructions is described. In the described 
implementation, the machine-level operations utilized by Montgomery multiplication are 
performed using SSE2 (Streaming SIMD Extensions 2) instructions. 

BRIEF DESCRIPTION OF THE CONTENTS 
[0007] Fig. 1 illustrates a computer architecture that supports SIMD instructions. 
[0008] Fig. 2 illustrates use of SIMD instructions within Montgomery multiplication 
(which performs modular multiplication), and the placement of data values in registers. 
[0009] Fig. 3 is a flow diagram of a process for performing Montgomery multiplication 
on the computer architecture of Fig. 1. 

[00010] The same reference numbers are used throughout the disclosure to 
reference like elements and features. 



lee ©ha yes p* 509-32*^256 



3 



I0ISOMO2S MSI-164SUS.PAT.APP 



DETAILED DESCRIPTION 
[00011] Montgomery multiplication is a widely used, division-free, algorithm for 
modular multiplication. The following disclosure describes use of two-way SIMD 
(Single Instruction, Multiple Data) instructions within Montgomery multiplication. 
While Montgomery multiplication may be implemented on various computer 
architectures that support SIMD instructions, one particular implementation will be 
described in the context of a computer architecture that supports SIMD instructions 
operating on two independent operands. 

[00012] Computer Architecture 

[00013] Fig. 1 shows a computer architecture 100 that can be configured to 
implement use of two-way SIMD instructions for Montgomery multiplication. The 
computer architecture 100 includes a microprocessor 102 and memory 104. The 
microprocessor 102 has one or more ALUs (Arithmetic Logic Units) 106(1), 106(N) 
to manage mathematical operations, such as adding and multiplying, and logical operators 
like OR, AND, XOR, and so forth, as well as right and left shifts. The microprocessor 
102 further includes one or more registers 108(1), 108(M) to hold intermediate values 
and final results produced by the ALUs 106. 

[00014] One example of a microprocessor 102 is an Intel® IA-32 microprocessor, 
which has two ALUs 106 and eight 128-bit integer registers 108. The registers 108 are 
also known as XMM registers. Compilers from Intel Corporation and Microsoft 
Corporation support user access to these XMM registers without requiring the user to 
write assembly code. 
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[00015] The computer architecture 100 supports SIMD instructions. In our 
example, the Intel® IA-32 microprocessor supports SSE2 (Streaming SIMD Extensions 
2) instructions. The SSE2 instruction set principally enhances audio and video 
compression processes, bringing several enhancements dedicated to boost MPEG 2 
encoding and file encrypting processes. The SSE2 instruction set includes 144 
instructions that support, among other things, 128-bit SIMD integer arithmetic operations. 
The SSE2 instructions are also intended to assist in vectorizing a program. If, for 
example, one is operating on 32-bit data, then each 128-bit XMM register can hold four 
32-bit values. Using SSE2 instructions, these four 32-bit operands can be processed with 
a single instruction, and four 32-bit results obtained 

[00016] A sequence of SIMD instructions 110 is used to implement Montgomery 
multiplication, which is an algorithm for modular multiplication. The instructions are 
shown generally as residing in microprocessor 102, but they may be executed within any 
of ALUs 106(1)-106(N). These instructions direct the processor to perform computations 
involved in the Montgomery multiplication, perhaps during modular exponentiations in 
the cryptographic function. 

[00017] Modular Multiplication and Montgomery Multiplication 
[00018] By way of introduction to modular multiplication and Montgomery 
multiplication, consider a common division problem. Let n x and n 2 be integers with n 2 > 
0. Dividing n\ by n 2 , the Euclidean division algorithm gives a unique integer quotient q 
and remainder r such that: 

n\ = q n 2 + r (0 <r < n 2 ). 
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For discussion purposes, the mathematical expressions "quo" and "rem" are defined as 
follows: 

quo(/ii, w 2 ) = q (the quotient); and 
rem(«i, n 2 ) = r (the remainder). 

[00019] • The mathematical definition of modular multiplication starts with a 
positive integer modulus N, which is assumed to be fixed herein. Given two integers a 
and b, their modular product, modmul(a, b\ is defined to be rem(a&, N). One possible 
implementation of modular multiplication is to first compute the product ab 9 then divide 1 
this product by Nto yield a quotient q and a remainder r so that: 

(la) ab = qN+r (0 <r<N) 

Such an algorithm outputs the remainder r = rem(a&, TV) = modmul(a, b). 
[00020] Observing that integer division was slow on many computer architectures, 
Peter Montgomery, the named inventor in this application, found another algorithm for 
modular multiplication (i.e., Montgomery multiplication) that avoids the divisions and is 
faster when performing extensive computations with a single modulus. For background 
information on the Montgomery algorithm for modular multiplication, the reader is 
directed to the following publications: Peter L. Montgomery, "Modular Multiplication 
without Trial Division", Mathematics of Computation, Volume 44, Number 170, April 
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1985, pp. 519-521; and Donald E. Knuth, "The Art of Computer Programming", Volume 
2, "Seminumerical Algorithms", Third Edition, Springer- Verlag, 1998, exercise 4.3.1-41. 
[00021] Montgomery multiplication requires a fixed integer R that is coprime to 
modulus JV, meaning that the greatest common denominator of the fixed integer R and 
modulus N is one (i.e., gcd(A r , R) = 1). For ease of continuing discussion, it is assumed 
that the modulus N is odd and that the integer R is a power of 2 (except within an 
example which uses R = 10). Given this assumption, there exists an integer AT with NAT 
= 1 (mod R). The value of AT depends only on N (and R), and need be computed only 
once per modulus N. 

[00022] Given integers a and b (and with N 9 R known by context), the Montgomery 
product, montmul(a, 6), is defined as the integer r such that 0 <r' < Af and 

(lb) ab = r r R (mod N). 

[00023] This integer r exists (and is unique) because R is invertible modulo N. A 
formula for Montgomery multiplication follows: 

(Ml) montmul(a, b) = rem( (ab - qN)IR y N), where q = rem(ab N\ R). 

[00024] To verify equation (Ml), define r\ = (ab - qN)IR, where r\ is deemed to be 
an integer. That assertion follows from: 

qN= (ab AT ) N= ab (AT N) = ab (modi?). 
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[00025] 



Next, it is confirmed that f = rem(ri, AO satisfies equation (lb), since: 



r'R = rem(ri, N) R = r x R = ab- qN= ab (mod N). 
The integer r also satisfies 0 <r' <N, by definition of "rem". 

[00026] The right side of Montgomery's equation (Ml) is easy to evaluate on a 
binary computer if R is a power of 2 and if R is large enough to force | ab - qN \ < NR. 
The latter is ensured if 6 <a, b < N < R. 

[00027] To see the usefulness of Montgomery product "montmul", define a 
function "tomont" for integers x, such that tomont(;c) = rem(xi?, TV). Then, 

montmul(tomont(a), tomont(6)) R = tomont(a) tomont(Z>) // Definition of montmul 
= (aR)(bR) = (abR)R = tomont(ab)R (mod N). II Definition of tomont 

[00028] The assumption gcd(Af, R) = 1 allows this congruence to be divided by R, 
leaving: 

montmul(tomont(a), tomont(6)) = tomont(a6) (mod AO. 

[00029] The two sides must be equal since both are in the interval [0, AM] and 
they are congruent modulo N. Knowing ab = modmulfar, b) (mod AO, and that tomont(x) 
= tomont(y) if x =y (mod AO, it is concluded that: 
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montmul(tomont(a), tomont(6)) = tomont(a&) = tomont(modmul(a, b)) . 

[00030] Accordingly, the function "tomont" is a multiplicative homomorphism 
(meaning that images of products are products of images) from ZJNZ to ZJNZ if modmul 
is used to define multiplication on the domain and montmul to define multiplication on 
the image. Here ZJNZ denotes the integers modulo N. Further investigation reveals that, 

tomont(a) + tomont(6) = tomont(a + b) (mod AO 
tomont(a) - tomont(6) = tomont(a-&) (mod AO- 

[00031] Therefore, the function "tomont" is a ring homomorphism (with the image 
of 1 being rem(7?, AO). From, 

tomont(a) = tomont(6) if and only ifa = b (mod AO 

the function "tomont" is a ring isomorphism from ZJNZ to ZJNZ, not only a 
homomorphism. 

[00032] Multiple-Precision Montgomery Multiplication 

[00033] Having introduced Montgomery multiplication, this section illustrates use 
of multiple-precision numbers in Montgomery multiplication. For discussion purposes, 
consider use of modular multiplication within a cryptographic function. The RSA 
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cryptosystem, for example, requires modular exponentiation where the modulus might be 
512 bits, 1024 bits, or longer. Most computer architectures support only 32 x 32 -» 64- 
bit or 64 x 64 — > 128-bit products. When multiplying longer numbers, each operand is 
partitioned into several smaller (typically 32-bit or 64-bit) pieces. 

[00034] Let r = 2 32 or 2 M , depending upon the computer word size. Find a positive 
integer k such that > N (usually one chooses k as small as possible subject to this 
constraint). Seti? = /. 

[00035] Upper case letters, such as A, are used in this discussion to represent 

nonnegative multiple-precision numbers below. Values up to N can be represented by k 
radix-r digits. These digits are written with the same letter followed by an index in 
brackets. If 0 </ <k-\, then A\j] denotes the >th digit of A in radix r (with the 0-th digit 
being the least significant). That is, 

>4 = Io^i A\f]r j 0 <A\j] <r-\ 

Another way to write this is A = (A[k-\] A[k-2] ... ^[0]) r where the final subscript 
denotes the radix r. 

[00036] Given these notations, one algorithm for montmul(yi, B) where 0 <A, B < 
N<R is: 
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r 2 :=0; 

DISCARD := 0; 

for j from 0 to k-\ do 



muh := 

7i := T, + mu/| * B; 

tlow := T x [0]; 

Ti := (7, - tlow)lr ; 



// Temporarily need £+1 digits 
// tlow := rem(7i, r) 
//r,:=quo(r,,r) 



DISCARD := DISCARD + tlow * r J ; // For presentation purposes only 
wj«/ 2 := rem((f/ow - T 2 [0]) * AT [0], r); // So new T 2 will be an integer 
T 2 := (T 2 + mul 2 *N- tlow)lr ; 
end for; 

return rem(ri - T 2 , N); 



[00037] Checking by induction, it is found that, at the end of the each iteration: 



0 <T X <B 
0 <T 2 <N 
0 <DISCARD < r j+l 

T x r j+l + DISCARD = (A\j] A[j-\] . . . A[0]) r * B 
T 2 r y+1 + DISCARD = 0 (mod AO- 



[00038] Two corollaries are: 



-N<T X -T 2 <B 

(T x - T 2 ) r J+i = (4\f\ A\j-l] . . . A[0)) r * B (mod N) . 



[00039] After the last iteration (J = k-\) completes, the results are: 



-N<Tx-T 2 <B<N 
(Tx-T 2 )R = AB (mod N). 



[00040] Therefore, the function montmul(/l, B) will be T x - T 2 (when T x >T 2 ) or T, 
-T 2 +N (when T x <T 2 ). 



11 



101 5031025 MSI-1648US.PA TAPP 



[00041] With minor modifications, the updates to variables T\ and T 2 can be made 
to look very similar. With a small change in the initializations of tlow and mul 2 , the main 
loop becomes: 

for j from 0 to k-\ do 
muh :=A\j] ; 

tlow := rem(ri[0] + mulf B[Q], r); 
mul 2 := rem((f/ow - r 2 [0])* N'[0] , r); 

Ti := (7\ + mw/i * 5 - tf<w)/r ; 
. 7 2 := (r 2 + mul 2 *N- tlow)lr ; 
end for; 

[00042] After this rewrite, the T\ update adds a multiple of B whereas the T 2 update 
adds a multiple of N. The 7i, T 2 , 5, and AT arrays represent multiple-precision numbers < 
N < R = r k , in which all have length £ when the numbers are written in radix r. The 
multipliers mul\ and muh are single precision (i.e., 0 <mul\, mul 2 < r). The instructions 
needed to update T\ are very similar to those needed to update T 2 . Thus, the sequence of 
instructions to update both T\ and T 2 (taking into account that B and N are multiple- 
precision) is as follows: 
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[Update Code 1] 

sum x := T\[Q] + mul\*B[0]; II 0 <sum x <r 2 ~r 

II Use rem(5M/wi, r) = riow to compute 
/ww/2 (formula suppressed here) 

5w/w 2 := T 2 [0] + /ww/ 2 *Af[0]; 

:= qao{sum\, r); II 0 <sum\ <r - 1 

:= quo(sM/W2 5 r); 
for / from 1 to k-\ do 

swwi := sum x +T\ [i] + row/i *£[/]; // 0 <sum x <r 2 - 1 
sum 2 := $11/112 +7*2 [0 + /ww/ 2 *W[i']; 
7*i[i— 1] := rsm(sumu r); 
T 2 [i~l] := rem(5w/w 2 , r); 

sum\ := quo(rami, r); // 0 <sum\ <r - 1 

5WW2 := quo(5w/W2, r); 
end for; 

:=sum\; 
T 2 [k-l] :=sum 2 ; 

Contemporary computer architectures support pipelining and instruction-level 
parallelism, in which an operation related to updating array T\ is followed by an operation 
related to updating array T 2 . To further improve performance, it is desirable to update 
arrays T\ and T 2 together within the loop over 1, using one instruction per pair of updates. 



[00043] SIMP Instructions for Montgomery Multiplication 
[00044] The computer architecture 100 illustrated in Fig. 1 employs SIMD 
instructions 110 (and in our particular example, SSE2 instructions) to update 
simultaneously the arrays T\ and T 2 used in Montgomery multiplication, montmul(^, B). 
This process involves, in part, processing digits of A in a right-to-left manner and 
interleaving results of the multiplications involving B (i.e., mul\*B[i\) and modulus N 
(i.e., mul 2 *N[i]) that are used to update arrays Tiand T 2 . This will be described in more 
detail below. 
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[00045] As noted above, the SSE2 instructions assist in vectorizing a program, 
where multiple data sets can be held in a single register. To illustrate this point, for an 
operation involving 32-bit data, each 128-bit register 108 can hold four 32-bit values. A 
loop can then be written as follows: 



inG2 jc[100], j/[100], z[100]; // Assume int32 datatype occupies 32 bits 
for i from 0 to 99 do 
z[i\ :=x[i]+y[i]; 
end for; 



This loop can be partially unrolled (by the programmer or the compiler) so the generated 
code resembles: 



int32jc[100],j;[100],z[100]; 

for i4 from 0 to 96 by 4 do 
z[/4] :=x[i4]+y[i4]; 
z[/4+l] :=jt[i4+l]+j/[i4+l] 
z[/4+2] :=x[i4+2] + j/[i4+2] 
z[/4+3] :=x[i4+3]+y[i4+3] 

end for; 



[00046] One SSE2 instruction loads four 32-bit data values (4/4+3], x[/4+2], 
jc[i"4+1], *[/4]) from contiguous memory locations into an XMM register. Another SSE2 
instruction loads (y[/4+3], M'4+2], M)'4+l], y[i4]) to a different XMM register. A single 
PADDD instruction forms all four 32-bit sums, as follows: 



(4*4+3] +j/[/4+3], x[/4+2]+^[/4+2], jt[z'4+l] +j/[z4+l], x[i4] +y[i4]). 
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The four sums are then put into contiguous locations of the z array. By processing four 
array elements at a time, the number of loop iterations is reduced by a factor of four (from 
100 to 25). Another effect is that the compiler can reserve four 32-bit registers to hold 
the loop index z4 and the array addresses x, y, z, rather than clutter those registers with 
data values such as x[i4]. 

[00047] In the case of Montgomery multiplication, the SSE2 instructions are 
employed to update arrays T\ and T 2 together. Whereas in the last example, four 32-bit 
integer values are manipulated at once; here, SSE2 instructions are used to manipulate 
two 64-bit values. In the continuing example, assume that the radix r = 2 32 . The 
implementation will be described with reference to Fig. 2, which illustrates example 
registers and SSE2 commands. 

[00048] Looking at the pseudocode [Update Code 1] noted above, the values of 
sum\ and sum 2 are always in the interval [0, r 2 - 1] = [0, 2 m - 1]. One XMM register 
108(1) is used to hold both 64-bit values (sum u sum 2 ). Other XMM registers 108(2), 
108(4), and 108(5) (or memory locations) will hold the following pairs of 64-bit values: 

(mul\ 9 muli) 

(B[i],N[i\) (0<i<k-l) 
(Ti[i],T 2 [i\) (0 <i<k-\) 

[00049] Although the numerical values of mul u mul 2 , B[i], N[i], T } [i], T 2 [i] all fit 
into 32 bits, 64 bits are allocated for each. A constant (r - 1, r - 1) is also employed, 
which is 00000000FFFFFFFF00000000FFFFFFFF in hexadecimal. This constant is held 
in register 108(3). 
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[00050] Once these data layouts are selected, the inner loop of the pseudocode 
[Update Code 1] translates easily. As noted, three of the eight XMM registers 108(1), 
108(2), and 108(3) hold the pairs {sum u sum 2 ), (mul u mul 2 \ and (r -1, r - 1). All three 
registers are initialized before starting the inner loop. The inner loop body loads the pairs 
(£[/], N[i]) and (7i[i], T 2 [i]) into XMM registers 108(4) and 108(5), as represented by the 
two load instructions InstI: Load and Inst2: Load in Fig. 2. 

[00051] One SSE2 instruction "PMULUDQ", which multiplies two 32-bit values 
to produce a 64-bit unsigned product (i.e., 32 x 32 -» 64-bit unsigned product), performs 
both operations mul x *B[i\ and mul 2 *N[i] (i.e., Inst3: PMULUDQ in Fig. 2). Two SSE2 
instructions "PADDQ", which add two 64-bit values, are then used to yield: 

sum\ := sum\ +T\ [i] + mul\*B[i\; 
sum 2 := sum 2 +T 2 [i] + mul 2 *N[i\\ . 

with the updated (sum u sum 2 ) being placed in another XMM register 108(6) (i.e., Inst4: 
PADDQ and Inst5: PADDQ in Fig. 2). A copy operation (i.e., Inst6: Movdqa) makes a 
replica of {sum u sum 2 ). Then, one SSE2 instruction "PAND" (i.e., a 128-bit AND, which 
is the same as two 64-bit ANDs) with (r -1, r - 1)) (i.e., INST7: PAND) and a store 
operation (i.e., Inst8: Store) perform the following functions: 

T\ •= xtm{sumu r) 
TiC* - 1] := rem(5i/W2, r). 
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The PAND destroys one copy of (sum\ 9 sum 2 ) but the other is preserved. Then, an SSE2 
operation "PSRLQ" (i.e., right shift 64-bit data with zero fill) by 32 bits provides the 
following results (i.e., Inst9: PSRLQ): 

sum, := quo(sum u r) 
sum 2 := quo(sum 2 , 6- 

[00052] Excluding instructions for address computations and loop control (which 
use the 32-bit registers, not the XMM registers), the inner loop body needs only two 
loads, one multiply, two adds, one copy, one bitwise AND, one store, one shift - nine 
SSE2 instructions to update one element of array T\ and the corresponding element of 
array T 2 . The Microsoft Visual Studio .NET and Intel® IA-32 compilers have intrinsics 
that allow a user to generate this type of code without using assembly language. 
[00053] Notice that B[i] is processed in order from i = 0 to / = k -1, allowing for 
right-to-left processing of the digits in A, which is the multiplier of array B. Each 
iteration of the loop over A reprocesses all of 5. Also, notice that the multiplications 
mul x *B[i\ can be interleaved with multiplications mul 2 *N[i] as i advances, resulting in a 
rearrangement of the computations. Yet, the process efficiently produces the desired final 
result, as will be illustrated below in more detail. 
[00054] 

[00055] Memory Layout 

[00056] The pseudocode [Update Code 1] shown above assumes B[i] and N[i] are 
in adjacent 64-bit words, since the SSE2 load instructions (Movdqa) require the data 
being loaded to be contiguous in memory. This memory should also be aligned on a 16- 
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byte boundary. But when montmul(^, B) is invoked, the inputs B[0] through B[k-\] will 
typically be in one array of contiguous 32-bit words (aligned on a 4-byte boundary) and 
N[0] through N[k-l] will be in another such array. That is not the memory pattern 
desired. Accordingly, the computer can be implemented to perform an additional loop 
initialization to create the desired array layout. 

[00057] The problem is avoided by interleaving the B and N arrays into one 
temporary array with 128-bit elements (and properly aligned). The copies can be made 
early in montmul. The cost of these copies is negligible when k is large since each copied 
pair of elements will later be accessed k times. 

[00058] Similarly, the elements of arrays T\ and T 2 can be interleaved within an 
array with 128-bit elements. The final rem(ri - 7 2 , AO computation will store its output ' 
in standard form (i.e., adjacent 32-bit words). 

[00059] Operation 

[00060] Fig. 3 shows a process 300 for using SIMD instructions within 
Montgomery multiplication. The process 300 may be performed, for example, by 
computer architecture 100 and will be described with reference to both Figs. 1 and 2. The 
process 300 is illustrated as a series of blocks, where each block represents an operation 
or functionality performed by the computer architecture. The operations or functions may 
be performed in software, firmware, and/or hardware. As such, the blocks may be used to 
represent instructions stored on a medium, that when executed, perform the recited 
functionality. 
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[00061] At block 302, the values used in computing the function montmul(^, B) 
are initialized. These values include integer inputs A, B, temporary arrays T\, T 2 , and 
modulus N. At block 304, temporary arrays 7^ T 2 are set to zero. Values for input B and 
modulus N are interleaved into a vector and a pair of elements from these arrays will fit in 
one of the registers (e.g., register 108(4) in Fig. 2). 

[00062] At block 306 an iterative loop begins to process the digits of input A, from 
right-to-left. At block 308, assuming that array T\ is to be updated with a multiple of 
input B (the multiplier mul\ being a digit from array A), the processor determines what 
multiple mul 2 of modulus N allows the update arrays T\, T 2 to end with the same digit(s). 
At block 310, the multiplication (digit of A) times B (i.e., mul\*B[i] in the pseudocode 
Update Code 1) and multiplication of the determined multiple times modulus N (i.e., 
mul 2 *N[i\ in the pseudocode Update Code 1) are performed. At block 312, the arrays T h 
T 2 are updated. By using two arrays T\ and T 2 , the multiplication AB is interleaved with 
the multiplication qN in the Montgomery multiplication. The loop continues until all 
digits in input A have been processed, as represented by decision block 314. At block 
316, the result T\ - T 2 (mod AO is returned. 

[00063] Illustration of Loop Rearrangements Within Montmul Computation 
[00064] An example of the Montgomery function montmul will now be described 
with realistic, but fictitious data. First, a desired answer to a montmul computation is 
derived, followed by a discussion of how this result is obtained using the technique 
described above. 
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[00065] Letting A = 123 and B = 456, the problem is to compute montmul(123, 
456), with R = 1000 (decimal) and N= 789. By definition, the output will be an integer r' 
such that 0 < r' < 789 and 

123*456= 1000 r' (mod 789). 

Solving N AT = 1 (mod R), an integer AT is found to be 109 (i.e., 789 AT ■ 1 (mod 1000), 
yields AT = 109). If the modulus N = 789 has been used before, the relationship 789 AT = 
1 (mod 1000) can be solved once to obtain AT = 109. To check the result, plug A/' = 109 
into the relationship, 789*109 = 86001, which mod R = 1000 ends in 001. It is noted 
that, if this is the first use of the modulus 789, the extended Euclidean algorithm can be 
used. 

[00066] Next, using the Montgomery multiplication function [Ml] derived above 
(i.e., montmul(y4, B) = rem( (AB - qN)IR, N), where q = rem(AB N', R)), the product AB 
is computed to produce 56,088 (i.e., 123*456 = 56,088). A leading zero is inserted (i.e., 
056088) to form six digits in order to hold the product of two three-digit numbers. At 
modulo R = 1000, the last three digits in the result are 088. To find quotient q, the result 
088 is multiplied by the integer AT = 109 to produce 9592 (i.e., 088 * 109 = 9592). Once 
again, at modulo R = 1000, the bottom three digits of this result are 592 (i.e., q = 592). 
Multiplying the quotient q = 592 with the modulus AT = 789, and subtracting from the 
product of 123*456 (i.e., AB - qN) provides: 

056088 - 592*789 = 056088 - 467088 = -41 1000. 
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By construction, this result ends in 000. Dividing the value -411000 by R (i.e., forming 
the first element in the "rem" function (AB - qN)IR\ the final result is -411. Inserting 
these results back into the Montgomery function montmul yields the final desired answer: 

montmul(123, 456) = rem(-41 1, 789) = 378. 

[00067] With the desired answer as reference, the remaining portion of this section 
uses the example values in the multiple-precision implementation described above using 
SIMD instructions. The multiple-precision version in this exposition (with r = 10 and k = 
3) aims to thrice insert one zero digit at the bottom of the remainder, rather than insert all 
three zeros at once. It does this, in part, by processing digits in input A from right-to-left, 
beginning with the ones' digit and ending with the hundreds' digit. 
[00068] With N' = 109, its low-order digit is N'[0] = 9. The low order digit of the 
product AB, or 056088, is 8. The product 056088 * AT therefore ends in the same digit as 
8*9 = 72. The last digit is a 2, so subtract twice the modulus N (i.e., 789) from the 
product AB (i.e., 056088): 

056088 - 2*789 = 056088 - 1578 = 054510. 

[00069] The next step looks at the tens' digit 1 in the resulting value 054510. The 
product 05451 * AT (after suppressing the trailing zero) ends in the same digit as 1*9 = 9. 
Thus, a decision is made to subtract nine times the modulus N (i.e., 789), with one trailing 
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zero placeholder to ensure treatment of the tens' digit (i.e., 7890), from the result 054510 
yields: 

054510 - 9*7890 = 054510 - 71010 = -016500. 

[00070] Repeating this step one more time for the hundreds' digit 5 in the result, 
the product -0165 * N' (suppressing the two trailing zeros) ends in the same digit as 5*9 
= 45, which has a last digit of 5. Subtracting five times the modulus N (i.e., 789), with 
two trailing zero placeholders (i.e., 78900), from the result -016500 yields: 

-016500 - 5*78900 = -016500 - 394500 = -41 1000. 

[00071] As earlier, the output is montmul(123, 456) = rem(-411, 789) = 378. 
Here, we subtracted 2*789, 9*7890, and 5*78900 from 056088, giving the same outcome 
as when we earlier subtracted 592*789 from 056088. 

[00072] By using two temporary arrays, T x and T 2 , the multiplication AB (i.e., 
123*456 = 056088) can be interleaved with the reduction AB - qN (i.e., 056088 - 
592*789) to obtain the result -411000. Multiples of 456 will be added to array T\ for 
computing the multiplication AB while multiples of 789 will be added to array T 2 for 
computing the reduction of qN. Both T\ and T 2 will remain nonnegative. Initializing T } 
=T 2 = 0, the multiplication AB, or 123*456, begins at the lower digit of 123, or 3: 

T\ = T x + 3*456 = 0 + 1368 = 1368 
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[00073] The present T 2 = 0 ends in a zero. In order for array T 2 to end in an 8 like 
the present last digit of array Tx = 1368, it is determined to multiply the modulus N (i.e., 
789) by the factor of 2 (i.e., 2*9 = 18, which ends in an 8, which when added to 0 equals 

8): 

T 2 = T 2 + 2*789 = 0+1578 = 1578. 

[00074] Discard the trailing 8's digit in both arrays T x and T 2 (i.e., DISCARD = 8 
for the loop invariants), leaving: 

T x = 136 r 2 = 157. 

[00075] The next digit in A = 1 23 (working from the right) is a 2. 

T x = T x + 2*456 = 136 + 912 = 1048. 

The present T 2 = 157 ends in a 7. In order for array T 2 to end in an 8 like the present last 
digit of array Tx = 1048, it is determined to multiply the modulus N (i.e., 789) by the 
factor of 9 (i.e., 9*9 = 81, which ends in a 1, which when added to 7 equals 8): 

T 2 = T 2 + 9*789 = 157 + 7101 = 7258 
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[00076] 



Discarding the trailing 8's digit in both arrays T\ and T 2 (i.e., DISCARD = 



88 for the loop invariants) leaves: 



T\ = 104 



T 2 = 725. 



(00077) 



The last digit in A = \23 (working from the right) is a 1 : 



T x = T x + 1*456 = 104 + 456 = 560. 

The last digit in the present array T 2 = 725 is a 5. In order for array T 2 to end in a 0 like 
the present last digit of array T\ = 560, it is determined to multiply the modulus N (i.e., 
789) by the factor of 5 (i.e., 5*9 = 45, which ends in a 5, which when added to 5 equals 
1 0, which ends in a zero): 

72 = 72+ 5*789 = 725 + 3945 = 4670. 

[00078] Discard the trailing 0's digit in both arrays T\ and T 2 (i.e., DISCARD = 088 
for the loop invariants), leaves: 



T x =56 



T 2 = 467. 
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Observe that T\ = 56 has the leading digits of 123*456 = 56088 whereas T 2 = 467 has the 
leading digits of 592*789 = 467088. The shared bottom digits are in DISCARD = 088. 
The output is: 

montmul(123, 456) = rem(56 - 467, 789) = rem(-41 1, 789) = 378. 

[00079] It is noted that although the computations have been rearranged, the 
process produced the same final result. Six summands were added in a different order: 

003*456= 001368 

020*456= 009120 

100*456= 045600 
-002*789 = -001578 
-090*789 = -071010 
-500*789 = -394500 

-411000 

[00080] With the computer architecture 100 described above, and use of S1MD 
instructions, the arrays T\ and T 2 are updated together. For example, it can simultaneously 
perform both: 

T, = 136 + 2*456=1048 
72 = 157 + 9*789 =7258. 

Start by looking at the bottom (i.e., units) digits of the T\ update: 6 + 2*6 = 18. This 
ends in an 8 whereas 157 ends in a 7, so the multiplier of 789 in the T 2 update will be (8 - 
7) AT[0] = 9 (mod 10). Once this multiplier is determined, the bottom digits of the T 2 
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update are found to be 7 + 9*9 = 88. Discard the (shared) trailing 8's in 18 and 88, 
initializing a carries vector to the high digits (1,8). 

[00081] Proceeding through the T\ and T 2 updates, but doing them together using 
SMD instructions, the next step operates on the tens' digits (subscript 1) of the inputs: 

temp = carries + (r,[l], T 2 [l]) + (2, 9)*(5[1], N[l]) 
= (1,8) + (3, 5) + (2, 9)*(5, 8) 
= (14, 85). 

The two elements of temp are integers from 0 to 99. The bottom digits (4, 5) are saved as 
the new values of (T\[0], T 2 [0]) and the upper digits are placed in the carries vector (i.e., 
carries = (1, 8)), which by chance is the same as the old value of the carries vector. 
[00082] Doing likewise on the hundreds' digits. 

temp = carries + (7, [2], T 2 [2]) + (2, 9)*(5[2], N[2]) 
= (1,8) + (1,1) + (2, 9)*(4, 7) 
= (10, 72). 

[00083] The bottom digits (0, 2) are saved as the new values of (T\[l] 9 T 2 [l]) and 

the upper digits are placed in the carries vector (i.e., carries = (1, 7)). 

[00084] We have exhausted all digits in the input A = 123. The contents of the 

carries vector (i.e., carries = (1, 7)) are stored as the new values of (T\[2] y T 2 [2]). The 

new values (with the final 8's already discarded) are T\ = 104 and T 2 = 725. 

[00085] The inputs to the temp computation are integers from 0 to 9 and the output 

ranges from 0 to 99. For the hardware registers 108, this corresponds to inputs of 0 to 2 32 

- 1 and outputs of 0 to 2 M - 1. When (T } [2] y T 2 [2]) is referenced from memory, the 
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operands are in adjacent locations, as this is how they were stored on a previous iteration. 
More precisely, they are stored as adjacent 64-bit locations, even though their values fit 
into 32 bits. Pairs like (B[2], N[2]) are copied to adjacent locations early, and reused as 
each digit of A = 123 is processed. 

[00086] Example Source Code 

[00087] Example source code for implementing the pseudocode [Update Code 1] 
is provided below. 

[00088] Some macros are referenced within this code but not defined here. 
ptr_roundup_ml28i(tem/?s) takes a temporary array address aligned on a 4-byte boundary, 
and returns whichever of temps, temps+4 9 temps+S y temps+\2 is aligned on a 16-byte 
boundary. Given a 128-bit value x, the SSE2_digit2(x) macro returns the value in bits 63- 
32 of x, and the SSE2_digitO(x) macro returns the value in bits 31-0 of x. 
[00089] digitj, digit Jc, DWORDREG, DWORDREGC are all 32-bit types. The 
trailing "c" or "C" identifies a constant operand. 
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[00090] Some variable correspondences are: 



Text 


modmul_from_right_S SE2 


A,B 


b, a 


N 


modulus 


N'[0] 


minv 


k 


lng 


(B[i),N[i\) 


a_modulus[i] 


(Ti[i] 9 T 2 [i]) 


templ2[i] 


(2 32 -l,2 32 -l) 


mask02 


(mul\, muli) 


mull2 


{sumusumi) 


prod 12 or carry 12 



[00091] The computation of mul 2 , the multiple of N to be added to T 2 > would 
normally resemble: 

mul 2 = (7\[0] + mulx * B[0] - r 2 [0]) * N' (mod r). 

This formulation requires the multiplication mul\ * B[0] to finish before the 
multiplication by AT can begin. The source code instead uses: 

mul 2 = (T } [0] - T 2 [0]) * AT + mul x * (5[0] * AT) (mod r) 
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with B[0] * AT (mod r) precompiled. The revised mul 2 formula allows the two integer 
multiplications to overlap, if the hardware pipelines such computations. 



This is the first of several modmulchx files with algorithms for 
modular multiplication, both FROM_LEFT and FROM_RIGHT. 
All procedures have an argument list 

(digit_tc *a, *b, // Two numbers to multiply 

// 0 <= a, b < modulus 
digit_t *c // Product. May overlap a or b. 

mp_modulus_tc *pmodulo, // Struct with information about modulus 
digit_t * temps) // Temporaries (length 

// pmodulo- >modmul_algorithm_temps) 

FROM_LEFT codes return 

c == (a*b) mod modulus 

FROM_RIGHT codes return the Montgomery product 

c == a*b / RADIX^lng (mod pmodulo- >modulus) . 

where lng = pmodulo- >length. 

This file starts with 

modmul_f rom_l ef t_def aul t 
modmul_f rom_r ight_de f au 1 1 

which work on all architectures and lengths. It also has 

modmul_f rom_right_SSE2 

which is specific to X86 hosts with SSE2 instructions (e.g., Pentium 4). 

*/ 

/****************★********★*★***************************** *********************/ 
static BOOL WINAPI modmul_f rom_left_de fault 
(digitate *a, ^ // IN 

digit_tc *b, // IN 

digit_t *c, // OUT 

mp_modulus_tc *pmodulo, // IN 

digit_t *temps) // TEMPORARIES, at least 2*lng 

/* 

This implements ordinary modular multiplication. 
Given inputs a, b with 0 <= a, b < modulus < RADIX^lng, 
and where lng > 0, we form 



*/ 
{ 



c == (a*b) mod modulus. 

BOOL OK = TRUE; 
DWORDREGC lng = pmodulo- > length; 
digit_t *templ = temps; // Length 2*lng 

assert (pmodulo- >modmul_a Igor it hm_temps == 2*lng) ; 

OK = OK && multiply(a, lng, b, lng, tempi); // Double-length product 

OK = OK SlU divide (tempi, 2* lng, pmodulo- >modulus, lng, 

&pmodulo->lef t_reciprocal_l , digit_NULL, c) ; 

return OK; 
} // modmul_from_left_de fault 
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/***************************************^ 
static BOOL WINAPI modmul_f rom_right_de fault 
(digit_tc *a, // IN 

digit_tc *b, // IN 

digit_t *c, // OUT 

mp_modulus_tc *pmodulo, // IN 

digit t *temps) // TEMPORARIES , at least 2*lnq 

/* 

This implements Montgomery (FROM_RIGHT) multiplication. 
Let lng = pmodulo- >length > 0. 

Given inputs a, b with 0 <= a, b < pmodulo- >modul us < RADIX^lng, we form 

c == a*b / RADIX^lng (mod pmodulo- >modul us) . 

At the start of the loop on i, there exists templow 
(formed by the discarded values of 
LOW_DIGIT(prodl) = LOW_DIGIT(prod2) ) such that 

0 <= tempi, temp2 < modulus 
templ*RADIX^j + templow = b[0:j-l] * a 
temp2*RADIX A j + templow == 0 (mod modulus) 

When j = lng, we exit with c == tempi - temp2 (mod modulus) 

*/ 
{ 

BOOL OK = TRUE; 

DWORDREGC lng = pmodulo- > length; 

digit_t *templ = temps, *temp2 = temps + lng; // Both length lng 
digit_tc *modulus = pmodulo- >modulus ; 
digit_tc minv = pmodulo - >right_reciprocal_l ; 
digit_tc minvaO = minv*a[0]; // mod RADIX 
DWORDREG i , j ; 

digit_t carryl, carry2, mull, mul2; 
dblint_t prodl, prod2 ; 

assert (pmodulo- >modmul_algorithm_temps == 2* lng) ; 
// Case j = 0 of main loop, with tempi = temp2 = 0 beforehand, 
mull = b[0] ; 

mul2 = minvaO*mull; // Modulo RADIX 

carryl = HPRODUU (mull , a[0]); // Top of product 

carry2 = HPRODUU(mul2 , modulus [0] ) ; 

assert (mull*a[0] == mul2*modulus [0] ) ; // mod RADIX 

for (i = 1; i != lng; i++) { 

prodl = MULTIPLY_ADD1 (mull, a[i], carryl); 

prod2 = MULTIPLY_ADD1 (mul2 , modulus [i] , carry2) ; 
tempi [i-1] = LOW^DIGIT (prodl ) ; 
temp2[i-l] = LOW_DIGIT(prod2) ; 
carryl = HIGH_DIGIT (prodl) ; 
carry2 = HIGH_DIGIT (prod2) ; 

tempi [lng-1] = carryl; 
temp2[lng-l] = carry2; 

for (j = 1; j != lng; j++) { 
mull = b[j] ; 

mul2 = minvaO*mull + minv* (tempi [0] - temp2 [0] ) ; // Modulo RADIX 
prodl = MULTIPLY_ADDl(mull, a[0], tempi [0] ) ; 
prod2 = MULTIPLY_ADD1 (mul2, modulus [0] , temp2 [0] ) ; 

// Replace tempi by (tempi + b[j]*a - LOW_DIGIT (prodl) ) /RADIX 

// Replace temp2 by (temp2 + mul2*modulus - LOW_DIGIT (prod2) ) /RADIX 

assert (LOW_DI GIT (prodl) == L0W_DIGIT (prod2) ) ; 

carryl = HIGH_DIGIT (prodl) ; 
carry2 = HIGH_DIGIT (prod2 ) ; 
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for (i = 1; i != lng; i++) { 

prodl = MULTIPLY_ADD2 (mull, a[i], tempi [i] , carryl); 

prod2 = MULTIPLY_ADD2 (mul2, modulus [i], temp2[i], carry2) ; 
tempi [i-l] = LOW_DIGIT( prodl ) ; 
temp2[i-l] = L0W_DIGIT(prod2) ; 
carryl = HIGH_DIGIT (prodl ) ; 
carry2 = HIGH DIGIT (prod2 ) ; 

} 

tempi [lng-1] = carryl; 
temp2[lng-l] = carry2; 

OK = OK && sub_mod( tempi, temp2, c, modulus, lng); 

return OK; 
} // modmul_f rom_right_de fault ; 
/************************************^ 

#if TARGET == TARGET_IX86 && USESSE2 
static BOOL WINAPI modmul_f rom_right_SSE2 
(digitate *a, ~ // IN 

digit_tc *b, // IN 

digit_t *c, // OUT 

mp_modulus_tc *pmodulo, // IN 

digit_t *temps) // TEMPORARIES, at least 8*lng + 3 

/* 

With modmul_f rom_right_de fault (above) , the computations on 
(a, carryl, mull, prodl, tempi) are very similar to those on 
(modulus, carry2, mul2, prod2 , temp2) . 

This module takes advantage thereof, using X86 SSE2 instructions. 

The last code often fetches a[i] and modulus [i] together. 
We construct one temporary array with lng 128-bit words (hence 4*lng digit_t 
elements), whose i-th element contains (0, a[i], 0, modulus[ij). 
Another array of this length has (0 , 1 tempi [i] , 0, temp2[i}). 

*/ 
{ 

BOOL OK = TRUE; 

DWORDREGC lng = pmodulo- > length ; 
digit_tc *modulus = pmodulo->modulus; 
digitate minv = pmodulo- >right_reciprocal__l ; 
digitate minvaO = minv*a[0]; // mod RADIX 

const ml28i mask02 = _mm_set_epi32 ( 0 , RADIXM1, 0, RADIXM1) ; 

^ml28i *templ2 = ptr__roundup_ml28i (temps) ; // Length lng 

ml28i *a_modulus = templ2 + lng; // Length lng 

ml28i carryl2, mull2, prodl2; 

DWORDREG i , j ; 

assert (pmodulo->modmul_algorithm_temps == 8* lng + 3) ; 

for (i = 0; i != lng; i++) { 

a_modulus[i] = _mm_set_epi32 (0 , a[i], 0, modulus[ij); 
templ2(i] = mm setzero sil28(); 

} 

for (j = 0; j != lng; j++) { 
digit_tc mull = b[j] ; 
digitate mul2 = minva0*mull 

+ minv* (SSE2_digit2 (templ2 [0] ) - SSE2_digitO ( templ2 [0] ) ) ; 

// Modulo RADIX 

mull2 = _mm_set_epi32 (0, mull, 0, mul2) ; 

prodl2 = _mm_add_epi64 (templ2 [0] , _mm_mul_epu32 (mull2, a_modulus [0] ) ) ; 

// Replace tempi by (tempi + b[j]*a - LOW_DIGIT (prodl )) /RADIX 

// Replace temp2 by (temp2 + mul2*modulus - LOW_DIGIT (prod2 )) /RADIX 

assert (SSE2_digitO (prodl2) == SSE2_digit2 (prodl2) ) ; 
carryl2 = _mm_srli_epi64 (prodl2 , 32); 

// Upper halves of products: (0, carryl, 0, carry2) 
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for (i = 1; i != lng; i++) { 

prodl2 = __mm_add_epi64 (_mm_add_epi64 (carry!2, templ2 [i] ) , 

_mm_jnul_epu32 (mull2, a_modulus [i] ) ) ; 
templ2[i-l] = _mm_and_sil28 (mask02 , prodl2) ; // Lower halves 
carryl2 = _mm_srli_epi64 (prodl2 , 32); // Upper halves 

templ2 [lng-1] = carryl2; 
} //for j 

II ^ Do the equivalent of sub_same, but with non-contiguous input. 

digit_t borrow = 0; 

for (i = 0; i != lng; i++) { 

digitate ai = SSE2_digit2 ( templ2 [i] ) ; 

digit_tc bi = SSE2_digitO ( terapl2 [i] ) ; 

digit_tc ci = ai - bi - borrow; 

c [i] = ci; 

borrow = ai x ( (ai " bi) | (ai * ci)); // MAJORITY (~ai , bi, ci) 
borrow >>= (RADIX BITS - 1) ; 

} 

if (borrow != 0) borrow add_same(c, modulus, c, lng); 
assert (borrow == 0) ; 

} 

return OK; 
} // modmul_f rom_right_SSE2 
#endif // TARGET_IX86 



[00092] Conclusion 



[00093] Although the invention has been described in language specific to 
structural features and/or methodological acts, it is to be understood that the invention 
defined in the appended claims is not necessarily limited to the specific features or acts 
described. Rather, the specific features and acts are disclosed as exemplary forms of 
implementing the claimed invention. 
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